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Abstract1 

The  HPCchallenge  benchmark  suite  has  been  released  by 
the  DARPA  HPCS  program  to  help  define  the 
performance  boundaries  of  future  Petascale  computing 
systems.  The  suite  is  composed  of  several  well  known 
computational  kernels  (STREAM,  Top500,  FFT,  and 
RandomAccess)  that  span  high  and  low  spatial  and 
temporal  locality.  These  kernels  also  encompass  key 
aspects  of  embedded  signal  processing:  vector 
computations,  matrix  multiplies,  corner  turns  and  random 
selection  operations.  MATLAB®2  is  the  primary  high  level 
language  used  within  the  signal  processing  community  and 
is  increasingly  used  for  large  system  simulations  and 
quickly  processing  data  in  the  field.  The  pMatlab  parallel 
MATLAB  toolbox  provides  the  necessary  global  array 
semantics  to  allow  HPCchallenge  to  be  implemented.  The 
results  provide  a  unique  opportunity  to  probe  both  the 
relative  (pMatlab  vs.  MATLAB)  and  absolute  (pMatlab  vs. 
C/Fortran+MPI)  merits  of  pMatlab.  Specifically,  for  each 
kernel  in  HPCchallenge  we  examine  code  size,  maximum 
problem  size,  and  performance.  We  find  pMatlab  code  to 
be  approximately  lOx  smaller  than  the  equivalent  C/MPI 
code.  The  problem  sizes  possible  using  pMatlab  scale 
linearly  with  the  number  of  processors  (e.g.  we  are  able  to 
FFT  a  228  complex  vector  on  16  CPUS),  and  are 
comparable  to  the  corresponding  C/Fortran+MPI  code. 
Finally,  the  scalability  of  the  kernels  approaches  that  of  the 
C/Fortran+MPI  code. 

Introduction 

The  HPCchallenge 

The  DARPA  High  Productivity  Computing  Systems 
(HPCS)  program  has  initiated  a  fundamental  reassessment 
of  how  we  define  and  measure  performance, 
programmability,  portability,  robustness  and,  ultimately, 
productivity  in  the  HPC  domain  [1].  With  this  in  mind, 
HPCchallenge  is  designed  to  approximately  bound 
computations  of  high  and  low  spatial  and  temporal  locality 
for  Petascale  systems.  Figure  1  illustrates  the  approximate 
spatial/temporal  relationship  of  the  different  kernels  and 
the  connections  to  important  operations  in  the  embedded 
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Matlab  is  a  registered  trademark  of  The  Mathworks,  Inc. 


signal  processing  community.  In  addition,  because 
HPCchallenge  consists  of  simple  mathematical  operations, 
this  provides  a  unique  opportunity  to  look  at  language  and 
parallel  programming  model  issues.  This  paper  compares 
traditional 
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Figure  1:  HPCchallenge  kernels  are  plotted  relative  to 
spatial  and  temporal  locality. 


The  pMatlab  Parallel  Toolbox 

The  pMatlab  toolbox  implements  global  array  semantics  in 
MATLAB.  pMatlab  provides  high-level  parallel  data 
structures  and  functions  without  removing  the  fast 
prototyping  capability  and  ease  of  use  for  which  MATLAB 
is  well  known  [2].  This  is  achieved  by  combining  operator 
and  function  overloading  with  the  concept  of  parallel  data 
and  task  mapping  to  provide  implicit  data  and 
computational  parallelism.  pMatlab  is  currently  being 
used  for  simulating  signal  processing  chains  and  for  rapid 
analysis  of  sensor  data  in  the  field.  The  implementation  of 
the  HPCchallenge  using  pMatlab  offers  a  means  for  more 
detailed  performance  analysis  of  pMatlab. 


Parallel  Implementation 

STREAM  consists  of  four  local  operations  performed  on 
distributed  vectors:  copy,  scaling,  addition,  and  scaling 
with  addition.  All  of  these  operations  are  important  in 
signal  and  image  processing.  The  STREAM  benchmark 
requires  no  interprocessor  communication  and  is 
implemented  using  simple  distributed  matrices. 

RandomAccess  is  designed  to  measure  the  random 
access  capabilities  of  a  computer  system.  This  is 
accomplished  by  effectively  computing  the  histogram  of  a 
random  number  generator,  replacing  the  typical  addition 
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update  with  a  bit  level  XOR  operation.  The  ability  to 
randomly  access  data  and  perform  logical  operations  are 
standard  “post  detection”  signal  processing  operations. 
RandomAccess  requires  dynamic  communications  among 
all  the  processors  and  is  implemented  using  parallel  sparse 
arrays. 

The  Top500  Linpack  Benchmark  uses  an  LU  Solver  to 
solve  a  dense  linear  system  of  equations  such  as  Ax=b. 
Such  an  algorithm  requires  selecting  and  communicating 
arbitrary  parallel  sub-matrices  typical  of  many  dense  linear 
algebra  operations.  At  the  core  of  LU  are  matrix-matrix 
multiplies  typical  of  multi-element  beamforming 
operations. 

The  FFT  kernel  performs  a  1-D  Fast  Fourier  Transform. 
The  1-D  FFT  is  performed  by  computing  two  2-D  FFTs, 
and  then  corner-turning  the  distributed  matrix  in  between 
the  two  computations.  Both  the  local  2D  FFTs  and  large 
matrix  comer  turns  are  among  the  most  important 
operations  in  multi-sensor  signal  processing. 

Results 

For  each  kernel  in  the  HPCchallenge,  we  examine  code 
size,  maximum  problem  size,  and  performance  on  a  Linux 
cluster  consisting  of  dual  3.0  GHz  Xeon  processors 
connected  with  Gigabit  Ethernet.  Examining  code  size,  we 
find  pMatlab  code  to  be  approximately  lOx  smaller  than 
the  equivalent  C/F77+MPI  code.  Approximate  software 
lines  of  code  numbers  for  the  HPCchallenge  kernels  are 
shown  in  Table  1. 

The  maximum  problem  sizes  possible  using  pMatlab 
scale  linearly  with  the  number  of  processors  used  and  are 
comparable  to  the  corresponding  C/F77+MPI  code.  Figure 
2  illustrates  this  for  the  Top500  kernel.  The  maximum 
input  matrix  size  mn  on  16  processors  (28K  x  28K)  is  16x 
the  maximum  size  that  can  be  run  on  a  single  processor 
(7K  x  7K).  Figure  3  shows  the  performance  and 
maximum  problem  size  achieved  in  the  pMatlab  FFT  code 
relative  to  serial  MATLAB,  which  uses  FFTW  [4]  to 
implement  its  Fourier  Transform.  The  performance 
scalability  is  typical  of  that  seen  in  C/F77+MPI 
implementation. 
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Figure  2:  Maximum  input  matrix  data  sizes  are  plotted  for 
the  Top500  kernel.  Each  matrix  contained  real  double¬ 
precision  data. 


Figure  3:  Performance  (Flops)  and  scalability  results  are 
plotted  for  the  FFT  kernel.  Results  are  relative  to  the  serial 
MATLAB  performance.  Numbers  next  to  the  points  indicate 
the  size  of  the  complex  vector  used. 
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Table  1:  C/Fortran  +  MPI  vs.  pMatlab  software  lines  of 
code  for  four  of  the  HPCchallenge  benchmarks. 
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Motivation  and  Goals 


*  Motivation 

-  The  DARPA  HPCS  program  has  created  the  HPCchallenge 
benchmark  suite  in  an  effort  to  redefine  how  we  measure 
productivity  in  the  HPC  domain 

-  Implementing  the  HPCchallenge  benchmarks  using  pMatlab 
allows  a  unique  opportunity  to  explore  the  merits  of  pMatlab 
with  respect  to  HPEC 

•  Goals 

-  Compare  traditional  C/MPI  with  pMatlab.  Measurements  of 
productivity  include: 

•  Maximum  problem  size:  Largest  problem  that  can  be  solved  or  fit 
into  memory 

•  Execution  performance:  Run-time  performance  of  the  benchmark 

•  Code  size:  Software  lines  of  code  (SLOC)  required  to  implement 
the  benchmark 
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HPCchallenge  Relevance  to  HPEC 


*  HPCchallenge  benchmarks 
encompass  key  embedded 
signal  processing  operations 

FFT:  Distributed  corner  turn 
and  FFTs  important  in  multi¬ 
sensor  signal  processing 

RandomAccess:  Random 
data  accesses  typical  of 
“post  detection”  operations 

Top500:  Matrix-matrix 
multiplies  typical  of  multi¬ 
element  beamforming 

STREAM:  Distributed  vector 
operations  common  to 
signal  processing 
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FFT  Results 
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*  pMatlab  memory  scalability 
comparable  to  C/MPI  (128x  on  128 
CPUs) 

*  pMatlab  execution  performance 
comparable  to  C/MPI  (55x  on  128 
CPUs) 

*  pMatlab  code  size  is  35x  smaller  than 
C/MPI 
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Motivation 


The  DARPA  HPCS  program  has  created  the 
HPCchallenge  benchmark  suite  in  an  effort  to 
redefine  how  we  measure  productivity  in  the 
HPC  domain 


MATLAB®  is  the  primary  high  level  language 
used  within  the  signal  processing  community; 
increasingly  used  for 

-  large  system  simulations 

-  processing  data  in  the  field 


Goals 

*  Implement  and  analyze  the  performance 
of  HPCchallenge  benchmarks  using 
pMatlab 


Optimize  and  add  functionality  to  the 

pMatlab  toolbox 

Compare  traditional  C/MPI  with  MATLAB 

using  global  array  semantics. 

Measurements  of  productivity  include: 

-  Maximum  problem  size:  Largest  problem 
that  can  be  solved  or  fit  into  memory 

-  Execution  performance:  Run-time 
performance  of  the  benchmark 

-  Code  size:  Software  lines  of  code  (SLOC) 
required  to  implement  the  benchmark 


pMatlab  Software  Architecture 


Application 
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Messaging  (MatlabMPI)ll  Math  (MATLAB) 


Parallel 
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•  Can  build  a  parallel  library 

•  Can  build  applications 

with  a  few  messaging 

with  a  few  parallel 

primitives 

structures  and  functions 

•  MatlabMPI  provides  this 

•  pMatlab  provides 

messaging  capability: 

parallel  arrays  and 
functions 

MPI_Send(dest, comm, tag, X) ; 

X  =  ones (nf mapX) ; 

Y  =  zeros (n,mapY); 

Y ( : , : )  =  fft(X); 

pMatlab  implements  global  array  semantics  in 
MATLAB 

-  Global  array  semantics  allow  indexing  and  general 
element  access  for  distributed  data 


Implementing  the  HPCchallenge  benchmarks  using 
pMatlab  allows  a  unique  opportunity  to  explore  the 
merits  of  pMatlab  with  respect  to  high  performance 
embedded  computing 


Local  Benchmarks 

•  DGEMM  (matrix  x  matrix 
multiply) 

•  STREAM 

-  COPY,  SCALE,  ADD,  TRIAD 

•  RandomAccess 

•  FFT 


Global  Benchmarks 

•  Top500  (High  Performance 
UNPACK) 

•  PTRANS  —  parallel  matrix 
transpose 

•  RandomAccess 

•  FFT 


FFT  Results 
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•  pMatlab  memory  scalability 
comparable  to  C/MPI  (128x 
on  128  CPUs) 

•  pMatlab  execution 
performance  comparable  to 
C/MPI  (55x  on  128  CPUs) 

•pMatlab  code  size  is  35x 
smaller  than  C/MPI 
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1  pMatlab  memory  scalability 
comparable  to  C/MPI  (128x 
on  128  CPUs) 

’  pMatlab  execution 
performance  comparable  to 
C/MPI 

>  pMatlab  code  size  is  6x 
smaller  than  the  C/MPI 
implementation 
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Top500  Results 
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’  pMatlab  maximum  problem 
size  scales  86x  on  128  CPUs 
1  pMatlab  execution 
performance  scales  3x 

-  Removing  index  calculation 

overhead  will  significantly  improve 

•  pMatlab  code  size  is  66x 
smaller  than  C/MPI 
implementation 
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•pMatlab  memory  scalability 
comparable  to  C/MPI  (128x 
on  128  CPUs) 

•pMatlab  execution 
performance  comparable  to 
C/MPI  (128x  on  128  CPUs) 
•pMatlab  code  size  is  8x 
smaller  than  C/MPI 
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HPCchallenge  Relevance  to  HPEC 

•  Four  key  benchmarks  have  significant 
relevance  to  HPEC 

-  FFT :  Distributed  corner  turn 

and  FFTs  important  in  multi-sensor  signal 
processing 

-  RandomAccess:  Random  data  accesses 
typical  of  “post  detection”  operations 

-  Top500:  Matrix-matrix  multiplies  typical  of 
multi-element  beamforming 

-  STREAM:  Distributed  vector  operations 
common  to  signal  processing 


*  Multiple  implementations 

-  C/Fortran,  C/Fortran+MPI,  MATLAB,  pMatlab 


Conclusions 


Benchmark  Results  Summary 

•  Memory  scalability  comparable  to  C/MPI 
on  nearly  all  of  HPCchallenge  (for  128 
CPUs).  Allows  MATLAB  users  to  work  on 
much  larger  problems. 


•  Execution  performance  comparable  to 
C/MPI  on  nearly  all  of  HPCchallenge  (for 
128  CPUs).  Allows  MATLAB  users  run 
their  programs  much  faster. 


•  Code  size  much  smaller.  Allows  MATLAB 
users  to  write  programs  much  faster  than 
C/MPI 


•  pMatlab  allows  MATLAB  users  to 
effectively  exploit  parallel  computing,  and 
can  achieve  performance  comparable  to 
C/MPI. 


pMatlab  Goal:  Maps  and  Distributed  Matrices 


Benchmark  Platform 


HPCchallenge  Benchmark  Results:  C/MPI  vs.  pMatlab 
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